47 research outputs found
A quick search method for audio signals based on a piecewise linear representation of feature trajectories
This paper presents a new method for a quick similarity-based search through
long unlabeled audio streams to detect and locate audio clips provided by
users. The method involves feature-dimension reduction based on a piecewise
linear representation of a sequential feature trajectory extracted from a long
audio stream. Two techniques enable us to obtain a piecewise linear
representation: the dynamic segmentation of feature trajectories and the
segment-based Karhunen-L\'{o}eve (KL) transform. The proposed search method
guarantees the same search results as the search method without the proposed
feature-dimension reduction method in principle. Experiment results indicate
significant improvements in search speed. For example the proposed method
reduced the total search time to approximately 1/12 that of previous methods
and detected queries in approximately 0.3 seconds from a 200-hour audio
database.Comment: 20 pages, to appear in IEEE Transactions on Audio, Speech and
Language Processin
Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation
Self-supervised learning general-purpose audio representations have
demonstrated high performance in a variety of tasks. Although they can be
optimized for application by fine-tuning, even higher performance can be
expected if they can be specialized to pre-train for an application. This paper
explores the challenges and solutions in specializing general-purpose audio
representations for a specific application using speech, a highly demanding
field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose
model, to close the performance gap with state-of-the-art (SOTA) speech models.
To do so, we propose a new task, denoising distillation, to learn from
fine-grained clustered features, and M2D for Speech (M2D-S), which jointly
learns the denoising distillation task and M2D masked prediction task.
Experimental results show that M2D-S performs comparably to or outperforms SOTA
speech models on the SUPERB benchmark, demonstrating that M2D can specialize in
a demanding field. Our code is available at:
https://github.com/nttcslab/m2d/tree/master/speechComment: Interspeech 2023; 5 pages, 2 figures, 6 tables, Code:
https://github.com/nttcslab/m2d/tree/master/speec
INTER-TRIAL DIFFERENCE ANALYSIS THROUGH APPEARANCE-BASED MOTION TRACKING
The purpose of this study is to develop a method for quantitative evaluation and visualization of inter-trial differences in the motion of athletes. Previous methods for kinematic analyses of human movement have required attaching specific equipment to a body segment or can only be used in an environment designed for analyses. Therefore, they are difficult to use for observing motions in real games. To enhance the applicability to real-game situations, we propose appearance-based motion tracking. Our method only requires an image sequence from a camera. From the image sequence, automatic detection of trials and a difference analysis of them are conducted. We applied our method to the analysis of pitching motions in actual baseball games. Though we have no quantitative evaluations yet, the experimental results imply the efficacy of our method
Deep Attentive Time Warping
Similarity measures for time series are important problems for time series
classification. To handle the nonlinear time distortions, Dynamic Time Warping
(DTW) has been widely used. However, DTW is not learnable and suffers from a
trade-off between robustness against time distortion and discriminative power.
In this paper, we propose a neural network model for task-adaptive time
warping. Specifically, we use the attention model, called the bipartite
attention model, to develop an explicit time warping mechanism with greater
distortion invariance. Unlike other learnable models using DTW for warping, our
model predicts all local correspondences between two time series and is trained
based on metric learning, which enables it to learn the optimal data-dependent
warping for the target task. We also propose to induce pre-training of our
model by DTW to improve the discriminative power. Extensive experiments
demonstrate the superior effectiveness of our model over DTW and its
state-of-the-art performance in online signature verification.Comment: Accepted at Pattern Recognitio